Parallelizing Software-Implemented Error Detection

نویسندگان

  • Ute Schiffel
  • André Schmitt
  • Martin Süßkraut
  • Stefan Weigert
  • Christof Fetzer
چکیده

Because of economic pressure, more commodity hardware with insufficient error detection is used in critical applications. Moreover, it is expected that commodity hardware is becoming less reliable because of the continuously decreasing feature size. Thus, we expect that software-implemented approaches to deal with unreliable hardware will be needed. Arithmetic codes are well suited for this purpose because they can provide very good error detection capabilities independent of the actual failure modes of the underlying hardware. But arithmetic codes generate high slowdowns. This paper describes our encoding which uses an expensive AN-code. Second, we show how we harness the power of modern multicore CPUs to parallelize this expensive but flexible and powerful software-implemented fault detection technique. Our measurements show that under continuous probabilistic error injection, AN-encoding reduces the number of runs with incorrect output from 15.9% for the unencoded execution to 0.5% in the encoded case. Our parallelization reduces the observed slowdowns by an order of magnitude.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Evaluation of the Error Detection Mechanisms in MARS Using Software-Implemented Fault Injection

The concept of fail silent nodes greatly simpli es the design and safety proof of highly dependable fault tolerant computer systems The MAintainable Real Time System MARS is a computer system where the hardware operating system and application level error detec tion mechanisms are designed to ensure the fail silence of nodes with a high probability The goal of this paper is two fold First the e...

متن کامل

The FTMPS-Project: Design and Implementation of Fault-Tolerance Techniques for Massively Parallel Systems

The FTMPS-project provides a solution to the need for faulttolerance in large systems . A complete fault-tolerance approach is developed and being implemented . The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as perananent failures . Combined with the diagnosis software, the necessary information for t...

متن کامل

The FTMPS { Project : Design and Implementation of Fault { Tolerance Techniques for Massively Parallel Systems 1

The FTMPS-project provides a solution to the need for fault{ tolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the...

متن کامل

KUDA: GPU Accelerated Split Race Checker

We propose a novel approach for runtime verification on computers with a large number of computation cores, without any hardware extension to mainstream PC environment. The goal of the approach is making use of all hardware resources to decouple the computational overhead of traditional race checkers via parallelizing the runtime verification. We distinguish between two kinds of computational o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009